NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An Empirical Study of Microscaling Formats for Low-Precision LLM Training

Yang, Hanmei; Deng, Summer; Nagpal, Amit; Naumov, Maxim; Janani, Mohammad; Liu, Tongping; Guan, Hui (April 2025, 2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH))

Free, publicly-accessible full text available April 17, 2026
An Empirical Study of Microscaling Formats for Low-Precision LLM Training

Yang, Hanmei; Deng, Summer; Nagpal, Amit; Naumov, Maxim; Janani, Mohammad; Liu, Tongping; Guan, Hui (April 2025, 2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH))

Free, publicly-accessible full text available April 17, 2026
Understanding and Alleviating Memory Consumption in RLHF for LLMs

Zhou, Jin; Yang, Hanmei; Tang, Steven; Xiang, Mingcan; Guan, Hui; Liu, Tongping (October 2024, Machine Learning for Systems Workshop at (NeurIPS 2024).)

Full Text Available
Understanding and Alleviating Memory Consumption in RLHF for LLMs

Zhou, Jin; Yang, Hanmei; Tang, Steven; Xiang, Mingcan; Guan, Hui; Liu, Tongping (October 2024, Machine Learning for Systems Workshop at (NeurIPS 2024).)

Full Text Available
NUMAlloc: A Faster NUMA Memory Allocator

https://doi.org/10.1145/3591195.3595276

Yang, Hanmei; Zhao, Xin; Zhou, Jin; Wang, Wei; Kundu, Sandip; Wu, Bo; Liu Tongping (June 2023, ACM SIGPLAN International Symposium on Memory Management)

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the coop- eration of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allo- cators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc , that is de- signed for the NUMA architecture. NUMAlloc is centered on a binding-based memory management. On top of it, NUMAl- loc proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evalua- tion, NUMAlloc hasthebestperformanceamongallevaluated allocators, running 15.7% faster than the second-best allo- cator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more » « less
Full Text Available
NUMAlloc: A Faster NUMA Memory Allocator

https://doi.org/10.1145/3591195.3595276

Yang, Hanmei; Zhao, Xin; Zhou, Jin; Wang, Wei; Kundu, Sandip; Wu, Bo; Guan, Hui; Liu, Tongping (June 2023, ACM)

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more » « less
Deadlock prediction via generalized dependency

https://doi.org/10.1145/3533767.3534377

Zhou, Jinpeng; Yang, Hanmei; Lange, John; Liu, Tongping (July 2022, Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis)

Deadlocks are notorious bugs in multithreaded programs, causing serious reliability issues. However, they are difficult to be fully expunged before deployment, as their appearances typically depend on specific inputs and thread schedules, which require the assistance of dynamic tools. However, existing deadlock detection tools mainly focus on locks, but cannot detect deadlocks related to condition variables. This paper presents a novel approach to fill this gap. It extends the classic lock dependency to generalized dependency by abstracting the signal for the condition variable as a special resource so that communication deadlocks can be modeled as hold-and-wait cycles as well. It further designs multiple practical mechanisms to record and analyze generalized dependencies. In the end, this paper presents the implementation of the tool, called UnHang. Experimental results on real applications show that UnHang is able to find all known deadlocks and uncover two new deadlocks. Overall, UnHang only imposes around 3% performance overhead and 8% memory overhead, making it a practical tool for the deployment environment.
more » « less
Full Text Available
CachePerf: A Unified Cache Miss Classifier via Hybrid Hardware Sampling

https://doi.org/10.1145/3547353.3526954

Zhou, Jin; Tang, Steven; Yang, Hanmei; Liu, Tongping (June 2022, ACM SIGMETRICS Performance Evaluation Review)

The cache plays a key role in determining the performance of applications, no matter for sequential or concurrent programs on homogeneous and heterogeneous architecture. Fixing cache misses requires to understand the origin and the type of cache misses. However, this remains to be an unresolved issue even after decades of research. This paper proposes a unified profiling tool--CachePerf--that could correctly identify different types of cache misses, differentiate allocator-induced issues from those of applications, and exclude minor issues without much performance impact. The core idea behind CachePerf is a hybrid sampling scheme: it employs the PMU-based coarse-grained sampling to select very few susceptible instructions (with frequent cache misses) and then employs the breakpoint-based fine-grained sampling to collect the memory access pattern of these instructions. Based on our evaluation, CachePerf only imposes 14% performance overhead and 19% memory overhead (for applications with large footprints), while identifying the types of cache misses correctly. CachePerf detected 9 previous-unknown bugs. Fixing the reported bugs achieves from 3% to 3788% performance speedup. CachePerf will be an indispensable complementary to existing profilers due to its effectiveness and low overhead.
more » « less
Full Text Available
A Convex Variational Model for Restoring SAR Images Corrupted by Multiplicative Noise

https://doi.org/10.1155/2020/1952782

Yang, Hanmei; Li, Jiachang; Shen, Lixin; Lu, Jian (June 2020, Mathematical Problems in Engineering)
null (Ed.)
This paper studies a new convex variational model for denoising and deblurring images with multiplicative noise. Considering the statistical property of the multiplicative noise following Nakagami distribution, the denoising model consists of a data fidelity term, a quadratic penalty term, and a total variation regularization term. Here, the quadratic penalty term is mainly designed to guarantee the model to be strictly convex under a mild condition. Furthermore, the model is extended for the simultaneous denoising and deblurring case by introducing a blurring operator. We also study some mathematical properties of the proposed model. In addition, the model is solved by applying the primal-dual algorithm. The experimental results show that the proposed method is promising in restoring (blurred) images with multiplicative noise.
more » « less
Full Text Available
MemPerf: Profiling Allocator-Induced Performance Slowdowns

https://doi.org/10.1145/3622848

Zhou, Jin; Silvestro, Sam; Tang, Steven_Jiaxun; Yang, Hanmei; Liu, Hongyu; Zeng, Guangming; Wu, Bo; Liu, Cong; Liu, Tongping (October 2023, Proceedings of the ACM on Programming Languages)

The memory allocator plays a key role in the performance of applications, but none of the existing profilers can pinpoint performance slowdowns caused by a memory allocator. Consequently, programmers may spend time improving application code incorrectly or unnecessarily, achieving low or no performance improvement. This paper designs the first profiler—MemPerf—to identify allocator-induced performance slowdowns without comparing against another allocator. Based on the key observation that an allocator may impact the whole life-cycle of heap objects, including the accesses (or uses) of these objects, MemPerf proposes a life-cycle based detection to identify slowdowns caused by slow memory management operations and slow accesses separately. For the prior one, MemPerf proposes a thread-aware and type-aware performance modeling to identify slow management operations. For slow memory accesses, MemPerf utilizes a top-down approach to identify all possible reasons for slow memory accesses introduced by the allocator, mainly due to cache and TLB misses, and further proposes a unified method to identify them correctly and efficiently. Based on our extensive evaluation, MemPerf reports 98% medium and large allocator-reduced slowdowns (larger than 5%) correctly without reporting any false positives. MemPerf also pinpoints multiple known and unknown design issues in widely-used allocators.
more » « less

Search for: All records